feature space
SANSA: Unleashing the Hidden Semantics in SAM2 for Few-Shot Segmentation
Few-shot segmentation aims to segment unseen categories from just a handful of annotated examples. This requires mechanisms to identify semantically related objects across images and accurately produce masks. We note that Segment Anything 2 (SAM2), with its prompt-and-propagate mechanism, provides strong segmentation capabilities and a built-in feature matching process. However, we show that its representations are entangled with task-specific cues optimized for object tracking, which impairs its use for tasks requiring higher level semantic understanding. Our key insight is that, despite its class-agnostic pretraining, SAM2 already encodes rich semantic structure in its features. We propose SANSA (Semantically AligNed SegmentAnything 2), a framework that makes this latent structure explicit, and repurposes SAM2 for few-shot segmentation through minimal task-specific modifications. SANSA achieves state-of-the-art on few-shot segmentation benchmarks designed to assess generalization and outperforms generalist methods in the popular in-context setting. Additionally, it supports flexible promptable interaction via points, boxes, or scribbles, and remains significantly faster and more compact than prior approaches.
Neural Collapse in Cumulative Link Models for Ordinal Regression: An Analysis with Unconstrained Feature Model
A phenomenon known as "Neural Collapse (NC)" in deep classification tasks, in which the penultimate-layer features and the final classifiers exhibit an extremely simple geometric structure, has recently attracted considerable attention, with the expectation that it can deepen our understanding of how deep neural networks behave. The Unconstrained Feature Model (UFM) has been proposed to explain NC theoretically, and there emerges a growing body of work that extends NC to tasks other than classification and leverages it for practical applications. In this study, we investigate whether a similar phenomenon arises in deep Ordinal Regression (OR) tasks, via combining the cumulative link model for OR and UFM. We show that a phenomenon we call Ordinal Neural Collapse (ONC) indeed emerges and is characterized by the following three properties: (ONC1) all optimal features in the same class collapse to their within-class mean when regularization is applied; (ONC2) these class means align with the classifier, meaning that they collapse onto a one-dimensional subspace; (ONC3) the optimal latent variables (corresponding to logits or preactivations in classification tasks) are aligned according to the class order, and in particular, in the zero-regularization limit, a highly local and simple geometric relationship emerges between the latent variables and the threshold values. We prove these properties analytically within the UFM framework with fixed threshold values and corroborate them empirically across a variety of datasets. We also discuss how these insights can be leveraged in OR, highlighting the use of fixed thresholds.
Reconstructing Heterogeneous Biomolecules via Hierarchical Gaussian Mixtures and Part Discovery
Cryo-EM is a transformational paradigm in molecular biology where computa-1 tional methods are used to infer 3D molecular structure at atomic resolution from2 extremely noisy 2D electron microscope images. At the forefront of research is3 how to model the structure when the imaged particles exhibit non-rigid conforma-4 tional flexibility and compositional variation where parts are sometimes missing.5 We introduce a novel 3D reconstruction framework with a hierarchical Gaussian6 mixture model, inspired in part by Gaussian Splatting for 4D scene reconstruction.7 In particular, the structure of the model is grounded in an initial process that infers8 a part-based segmentation of the particle, providing essential inductive bias in9 order to handle both conformational and compositional variability. The framework,10 called CryoSPIRE, is shown to reveal biologically meaningful structures on com-11 plex experimental datasets, and establishes a new state-of-the-art on CryoBench, a12 benchmark for cryo-EM heterogeneity methods.
Enhancing Interpretability in Deep Reinforcement Learning through Semantic Clustering
In this paper, we explore semantic clustering properties of deep reinforcement learning (DRL) to improve its interpretability and deepen our understanding of its internal semantic organization. In this context, semantic clustering refers to the ability of neural networks to cluster inputs based on their semantic similarity in the feature space. We propose a DRL architecture that incorporates a novel semantic clustering module that combines feature dimensionality reduction with online clustering.
Long-Tailed Recognition via Information-Preservable Two-Stage Learning
The imbalance (or long-tail) is the nature of many real-world data distributions, which often induces the undesirable bias of deep classification models toward frequent classes, resulting in poor performance for tail classes. In this paper, we propose a novel two-stage learning approach to mitigate such a majority-biased tendency while preserving valuable information within datasets. Specifically, the first stage proposes a new representation learning technique from the information theory perspective. This approach is theoretically equivalent to minimizing intraclass distance, yielding an effective and well-separated feature space. The second stage develops a novel sampling strategy that selects mathematically informative instances, able to rectify majority-biased decision boundaries without compromising a model's overall performance. As a result, our approach achieves state-of-the-art performance across various long-tailed benchmark datasets.
Conditional Representation Learning for Customized Tasks
Conventional representation learning methods learn a universal representation that primarily captures dominant semantics, which may not always align with customized downstream tasks. For instance, in animal habitat analysis, researchers prioritize scene-related features, whereas universal embeddings emphasize categorical semantics, leading to suboptimal results. As a solution, existing approaches resort to supervised fine-tuning, which however incurs high computational and annotation costs. In this paper, we propose Conditional Representation Learning (CRL), aiming to extract representations tailored to arbitrary user-specified criteria. Specifically, we reveal that the semantics of a space are determined by its basis, thereby enabling a set of descriptive words to approximate the basis for a customized feature space. Building upon this insight, given a user-specified criterion, CRL first employs a large language model (LLM) to generate descriptive texts to construct the semantic basis, then projects the image representation into this conditional feature space leveraging a vision-language model (VLM). The conditional representation better captures semantics for the specific criterion, which could be utilized for multiple customized tasks. Extensive experiments on classification and retrieval tasks demonstrate the superiority and generality of the proposed CRL.
GOOD: Training-Free Guided Diffusion Sampling for Out-of-Distribution Detection
Recent advancements have explored text-to-image diffusion models for synthesizing out-of-distribution (OOD) samples, substantially enhancing the performance of OOD detection. However, existing approaches typically rely on perturbing textconditioned embeddings, resulting in semantic instability and insufficient shift diversity, which limit generalization to realistic OOD. To address these challenges, we propose GOOD, a novel and flexible framework that directly guides diffusion sampling trajectories towards OOD regions using off-the-shelf in-distribution (ID) classifiers. GOOD incorporates dual-level guidance: (1) Image-level guidance based on the gradient of log partition to reduce input likelihood, drives samples toward low-density regions in pixel space.
Domain Adaptive Hashing Retrieval via VLM Assisted Pseudo-Labeling and Dual Space Adaptation
Unsupervised domain adaptive hashing has emerged as a promising approach for efficient and memory-friendly cross-domain retrieval. It leverages the model learned on labeled source domains to generate compact binary codes for unlabeled target domain samples, ensuring that semantically similar samples are mapped to nearby points in the Hamming space. Existing methods typically apply domain adaptation techniques to the feature space or the Hamming space, especially pseudo-labeling and feature alignment. However, the inherent noise of pseudo-labels and the insufficient exploration of complementary knowledge across spaces hinder the ability of the adapted model. To address these challenges, we propose a Vision-language model assisted Pseudo-labeling and Dual Space adaptation (VPDS) method.
Beyond Scalars: Concept-Based Alignment Analysis in Vision Transformers
Measuring the alignment between representations lets us understand similarities between the feature spaces of different models, such as Vision Transformers trained under diverse paradigms. However, traditional measures for representational alignment yield only scalar values that obscure how these spaces agree in terms of learned features. To address this, we combine alignment analysis with concept discovery, allowing a fine-grained breakdown of alignment into individual concepts. This approach reveals both universal concepts across models and each representation's internal concept structure. We introduce a new definition of concepts as non-linear manifolds, hypothesizing they better capture the geometry of the feature space. A sanity check demonstrates the advantage of this manifold-based definition over linear baselines for concept-based alignment. Finally, our alignment analysis of four different ViTs shows that increased supervision tends to reduce semantic organization in learned representations.
Enhancing Interpretability in Deep Reinforcement Learning through Semantic Clustering
In this paper, we explore semantic clustering properties of deep reinforcement learning (DRL) to improve its interpretability and deepen our understanding of its internal semantic organization. In this context, semantic clustering refers to the ability of neural networks to cluster inputs based on their semantic similarity in the feature space. We propose a DRL architecture that incorporates a novel semantic clustering module that combines feature dimensionality reduction with online clustering.